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SOME ASPECTS OF SYMMETRIC GAMMA PROCESS MIXTURES 


ZACHARIE NAULET AND ERIC BARAT 


Abstract. In this article, we present some specific aspects of symmetric Gamma process 
mixtures for use in regression models. We propose a new Gibbs sampler for simulating the 
posterior and we establish adaptive posterior rates of convergence related to the Gaussian 
mean regression problem. 


1. Introduction 

Recently, interest in a Bayesian nonparametric approach to the sparse regression problem 
based on mixtures emerged from works of Abramovich et al. (2000), de Jonge and van Zanten 
(2010) and Wolpert et al. (2011). The idea is to model the regression function as 

/(•)=/ K(x--)Q(dx), Q~ II*, (1) 

J x 

where K : X x —> M is a jointly measurable kernel function, and II* a prior distribution 
on the space of signed measure over the measurable space X. Although the model (1) is 
popular in density estimation Escobar and West (1994); Miiller et al. (1996); Ghosal and 
van der Vaart (2007a); Shen et al. (2013); Canale and De Blasi (2013) and for modeling 
hazard rates in Bayesian nonparametric survival analysis Lo and Weng (1989); Peccati and 
Priinster (2008); De Blasi et al. (2009); Ishwaran and James (2012); Lijoi and Nipoti (2014), 
it seems that much less interest has been shown in regression. 

Perhaps the little interest for mixture models in regression is due to the lack of variety in 
the choice of algorithms available, and in the insufficiency of theoretical posterior contraction 
results. To our knowledge, the sole algorithm existing for posterior simulations is to be found 
in Wolpert et al. (2011), when the mixing measure Q is a Levy process. On the other hand, 
The only contraction result available is to be found in de Jonge and van Zanten (2010) for a 
suitable semiparametric mixing measure. 

Indeed, both designing an algorithm or establishing posterior contraction results heavily 
depends on the choice of K and II* in equation (1); but above all also on the observation model 
we consider. This last point makes the study of mixtures in regression nasty to handle because 
of the diversity of observation models possible. In this article, we focus on the situation when 
Q is a symmetric Gamma process to propose both a new algorithm for posterior simulations 
and posterior contraction rates results. 
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In the first part of the paper, we propose a Gibbs sampler to get samples from the posterior 
distribution of symmetric Gamma process mixtures. The algorithm is sufficiently general to 
be used in all observation models for which the likelihood function is available. We begin with 
some preliminary theoretical result about approximating symmetric Gamma process mixtures, 
before stating the general algorithm. Finally, we make an empirical study of the algorithm, 
with comparison with the RJMCMC algorithm of Wolpert et al. (2011). 

The second part of the paper is devoted to posterior contraction rates results. We consider 
the mean regression model with normal errors of unknown variance, and two types of mixture 
priors: location-scale and location-modulation. The latter has never been studied previously, 
mainly because it is irrelevant in density estimation models. However, we show here that 
it allows to get better rates of convergence than location-scale mixtures, and thus might be 
interesting to consider in regression. 


2. Symmetric Gamma process mixtures 

Let (fi, £,P) be a probability space and (X,A) be a measurable space. We call a mapping 
Q : Q x A —> MU{±oo} a signed random measure if a; i—>■ Q(u),A) is a random variable for 
each A £ A and if A i-A Q(cu, A) is a signed measure for each u> £ fi. 

Symmetric Gamma random measures are infinitely divisible and independently scattered 
random measures (the terminology Levy base is also used in Barndorff-Nielsen and Schmiegel 
(2004), and Levy random measure in Wolpert et al. (2011)), that is, random measures with 
the property that for each disjoint A \,..., Ak £ A, the random variables Q(A {),..., Q(Ak) 
are independent with infinitely divisible distribution. More precisely, given a,r) > 0 and F a 
probability measure on X, a symmetric Gamma random measure assigns to all measurable set 
A £ A random variables with distribution SGa (aF(A),rj) (see appendix A). Existence and 
uniqueness of symmetric Gamma random measures is stated in Rajput and Rosinski (1989). 

In the sequel, we shall always denote by n* the distribution of a symmetric Gamma random 
measure with parameters a,rj and F. and we refer aF as the base distribution of Q ~ n*, 
and r] as the scale parameter. 

2.1. Location-scale mixtures. Given a measurable mother function g : —> M, we define 
the location-scale kernel Ka(x) := g(A~ 1 x), for all x £ and all A £ £, where £ denote 
the set of all d X d positive definite real matrices. Then we consider symmetric Gamma 
location-scale mixtures of the type 

f(x-,uj):= I Ka(x — g) Q(dAd/x; oj), Vx £ M d , (2) 

Jsx 

where Q : £ x xfl —> [—oo, oo] is a symmetric Gamma random measure with base measure 
aF on £ x M d , and scale parameter rj > 0. The precise meaning of the integral in equation (2) 
is made clear in Rajput and Rosinski (1989). 


SOME ASPECTS OF SYMMETRIC GAMMA PROCESS MIXTURES 


3 


2.2. Location-modulation mixtures. As in the previous section, given a measurable mother 
function g : —> M, we define the location-modulation kernel K^ ^x) := g(x) cos (X^ =1 £ x i + 
(j>), for all x £ M rf , all £ £ and all 0 £ [0,vr/2]. Then we consider symmetric Gamma 
location-modulation mixtures of the type 


f(x-oj) : = 


K^ j( p(x — n) Q(d£d/j,d(j)-, u), Vx £ 


5 


( 3 ) 


■ x ] 


l d X [0,tt/2] 

l d x [0, 7t/2] x fl —y [— 00 , 00 ] is a symmetric Gamma random measure with 


where Q : M. d X ' 

base measure aF on x M. a x [0, 7 r/ 2 ], and scale parameter 77 > 0. 

2.3. Convergence of mixtures. Given a kernel A' : X x —> M and a symmetric Gamma 
random measure Q, it is not clear a priori whether or not the mixture y 1 —> J K(x',y ) Q(dx ) 
converges or not, and in what sense. According to Rajput and Rosinski (1989) (see also 
Wolpert et al. (2011)), y 1 —> f K(x;y ) Q{dx) converges almost-surely at all y for which 


(1 A | uK(x\y)\)\u\ l e ^ r 'F(dx) < + 00 . 


IxX 


Moreover, from the same references (or also in Kingman (1992)), if M is a complete normed 
space equipped with norm || ■ ||, then y ha f K(x;y) Q(dx ) converges almost-surely in M if 

f (1 A |it| \\K(x\ ■)\\)\u\~ 1 e~^ ri F(dx) < + 00 . 

JrxX 

Since by definition A is a probability measure, we have for instance that the mixtures of 
equations (2) and (3) converges almost surely in L°° as soon as | K& \\ 00 < +00 for A-almost 
every A £ £, or HA^Hoo < +00 for A-almost every (£,</>) £ x[0, 7r/2]. 

3. Simulating the posterior 

In this section we propose a Gibbs sampler for exploration of the posterior distribution 
of a mixture of kernels by a symmetric Gamma random measure. The sampler is based on 
the series representation of the next theorem, inspired from a result about Dirichlet processes 
from Favaro et al. (2012), adapted to symmetric Gamma processes. In theorem 1, we consider 
M(X) the space of signed Radon measures on the measurable space (X,A). By the Riesz- 
Markov representation theorem (Rudin, 1974, Chapter 6), At (X) can be identified as the dual 
space of C c (X), the space continuous functions with compact support. That said, we endow 
Ai(X) with the topology T v of weak-* convergence (sometimes referred as the topology of 
vague convergence), that is, a sequence {y, n £ A4(A) : n £ N} converges to y £ M(X) with 
respect to the topology T v , if for all / £ C c (X). 


f(x) dfj, n (x) 


f(x)dfi(x) 


ix 


ix 


Dealing with prior distributions on A4(X), we shall equip A4(X) with a cr-algebra. Here it is 
always considered the Borel d-algebra of M(X) generated by F v - 
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Before stating the main theorem of this section, we recall that a sequence of random 
variables {Xi £ X : 1 < i < n} is a Polya urn sequence with base distribution aF(-), where 
F is a probability distribution on ( X,A ) and a > 0, if for all measurable set A £ A, 

P{X 1 eA) = F(A )), P(X k+1 eA\X 1 ,...,X k ) = F k (A)/F k (X), k = 2,...,n-l, 

where F\ := olF + ^!'=i <5a £■ We are now in position to state the main theorem of this section, 
which proof is given in appendix A. 

Theorem 1. Let X be a Polish space with Borel o-algebra, p > 0 be integer, T ~ Ga(a,rj), 

independently, J\,..., J p 1 ~ d SGa(l, 1), and {Xi € X : 1 < i < p} a Polya urn sequence 
with base distribution aF(-), independent of T and of the Ji’s. Define the random measure, 

Qp := \/T/pY^i= i A; bx, ■ Then Q p —>■ Q, where Q is a symmetric Gamma random measure 
with base distribution aF(-) and scale parameter ^ /rj. 

3.1. Convergence of sequences of mixtures. In theorem 1, we proved weak convergence of 
the sequence of approximating measures (Q p ) p >i to the symmetric Gamma random measure, 
but it is not clear that mixtures of kernels by Q p also converge. The next proposition establish 
convergence in L q for general kernels, with 1 < q < +oo, the proof is similar to the proof 
of Favaro et al. (2012, Theorem 2), thus we defer it into section 6.2. For any kernel K : 
X x —> C, and any (signed) measure Q on (X,A), we write 

f {Q \y)-= [ K{x\ y) Q(dx). 

J x 

Proposition 1. If x ^ K(x\y ) is continuous for all x £ X, vanishes outside a compact 
set, and bounded by a Lebesgue integrable function h, then for any 1 < q < +oo we have 
linip^oo ||/Wp) — f^\\ q = 0 almost-surely. 

Under supplementary assumptions on K, we can say a little-more about uniform conver¬ 
gence of the approximating sequence of mixtures. Assuming that y i—)• it (x; y) is in L 1 for all 
x £ X, we denote by (x,u) i—)• K(x]u) the L 1 Fourier transform on the second argument of 
{x,y) H- K{x-y). 

Proposition 2. Let y i —> K(x\y) be in L 1 for all x £ X and K satisfies the assumption of 
proposition 1. Then liuip^oo || f^v) — fiQ) = 0 almost-surely. 

Proof. We can assume without loss of generality that f^Qp) and f^ are defined on the 
same probability space (f^J^P). By duality, it is clear that ||/^ p ^(-;cn) _/«?)(. 

j w) 11 oo — 

f Rd \f(Qp)(u,u;) — /IQ) (u: u) | du, where f denote the L 1 Fourier transform of /. Notice that 
by assumptions on K, f(Qp') and f^ are well-defined for almost all uj £ Ll (see section 2.3). 
Then by Fubini’s theorem 

/ tQ p)(n;w)=/ [ K(x-,y)Q(dx;ui) e~ zuy dy 

JR d JX 
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IX JM. d 

and the conclusion follows from proposition 1. 


K(x-,y)e luy dy Q(dx;u) = / K(x] u) Q(dx; uj), 


lx 


□ 


3.2. General algorithm. From theorem 1, replacing Q by Q p for sufficiently large p. we 
propose a Polya urn Gibbs sampler adapted from algorithm 8 in Neal (2000). In the sequel, 
we refer to Q p as the particle approximation of Q with p particles. 

Let Y = {Yi)™ =l be observations coming from a statistical model with likelihood function 
C(f\Y), where / : —> M is the regression function on which we put a symmetric Gamma 

mixture prior distribution. Let X = (X t ) r ’ =] be a Polya urn sequence, J := (Ji,..., J p ) a 
sequence of i.i.d. SGa(l, 1) random variables, and T ~ Ga(a, ? j) independent of (X,;)? , and 
J. We introduce the clustering variables C := (C \,..., C p ) such that Ci = k if and only if 
Xi = X£ where X * := Xf ,... stands for unique values of (XjY =l . In the sequel, C-i stands 
for the vector obtained from removing the coordinate i to C, and the same definition holds for 
J mutatis mutandis. Given J, C, X, T and a measurable kernel K : X x M —> M we construct 
/ as 

It P 

f{x) = Ji K(Xi]x). 

We propose the following algorithm. At each iteration, successively sample from : 

(1) Ci\C-i,Y, X, J,T, for 1 < i < p. Let = #i <i< n {Ci = k}, the number of 
distinct X & values and kq a chosen natural, 

Ko 

a ^ J, T\Y ) 4(0 + - E T \ Y ) S k+K<*) (0» 

fc=! K ° fc=l 

where £fc,i(X, J, r|Y") stands for the likelihood under hypothesis that particle i is 
allocated to component k (note that the likelihood evaluation requires the knowledge 
of whole distribution F under any allocation hypothesis). 

(2) X\C, Y. J, T. Random Walk Metropolis Hastings on parameters. 

(3) K , Y. X, T, for 1 < i < p. Independent Metropolis Hastings with prior SGa(l, 1) 

taken as i.i.d. candidate distribution for Jj. Note that for n —> oo, the posterior dis¬ 
tribution of C\ Y, Z should be SGa(l,l), then the number of particles p may 

be monitored using the acceptance ratio of the Jj’s. 

(4) T\C,Y,X, J. Random Walk Metropolis Hastings on scale parameter. 

3.3. Assessing the convergence of the Markov Chain. The previous algorithm pro¬ 
duces a Markov Chain whose invariant distribution is (an approximation of) the posterior 
distribution of a symmetric Gamma process mixture. However, if the Markov Chain is initial¬ 
ized in a region of low posterior probability mass, we may over-sample this region. To avoid 
such over-sampling, we discard the first no samples of the chain using Geweke’s convergence 
diagnostic (Geweke, 1992). 
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More precisely, we monitor the convergence of the chain using the log-likelihood function. 
We start the algorithm with Markov Chain initialized at random from prior distribution. 
Then after n 3> rao iterations we compute Geweke’s Z-statistic for the log-likelihood using 
the whole chain; if the statistic is outside the 95% confidence interval we continue to apply 
the diagnostic after discarding 10%, 20%, 30% and 40% of the chain. If the Z-statistic is still 
outside 95% confidence interval, the chain is reported as failed to converge, and we restart 
the algorithm from a different initialization point. 

Once we have discarded the first no samples using Geweke’s test, we run the chain suf¬ 
ficiently longer to get an Effective Sample Size (ESS) of at least 1000 samples, where we 
measure the ESS through the value of the log-likelihood at each iteration of the Markov 
Chain. A thinning of the chain is not required in general, however, we found in practice that 
a slight thinning improves the efficiency of the sampling. 

In fig. 1, we draw some examples of temporal evolution of the log-likelihood on a simple 
univariate Gaussian mean regression problem. Here and after, we always choose step sizes 
in RWMH steps to achieve approximately 30% acceptance rates for each class of updates. 
Each subfigure represent 10 simulations with random starting point of the Markov Chain, 
distributed according to the prior distribution. We draw each subfigure varying the parameters 
liable to influence the mixing time of the chain, notably m and the number of particles. We 
observe that the speed at which the chain reach equilibrium is fast, especially when the number 
of particles is high. This last remark have to be balanced with the complexity in time of the 
algorithm which is Ofmnp) for a naive implementation, and, depending on the nature of the 
likelihood, can be reduced to 0(mp) or 0(mp 2 ). 

4. Examples of simulations 

We now turn our attention to simulated examples to illustrate the performance of mixture 
models. First, we use mixtures as a prior distribution on the regression function in the 
univariate mean regression problem with normal errors. Of course, the interest for mixture 
comes when the statistical model is more involved. Hence, in a second time we present 
simulation results for the multivariate inverse problem of CT imaging. 

4.1. Mean regression with normal errors. We present results of our algorithm on several 
standard test functions from the wavelet regression litterature (see Marron et al., 1998), 
following the methodology from Antoniadis et al. (2001) (i.e. Gaussian mean regression 
with fixed design and unknown variance). However, it should be noticed that mixtures are 
not a Bayesian new implementation of wavelet regression, and are much more general (see 
for instance the next section). For each test function, the noise variance is chosen so that 
the root signal-to-noise ratio is equal to 3 (a high noise level) and a simulation run was 
repeated 100 times with all simulation parameters constant, excepting the noise which was 
regenerated. We ran the algorithm for location-scale mixtures of Gaussians and Symmlet8, 
with normal N(0.5,0.3) distribution as prior distribution on translations, and a mixture of 
Gamma distributions for scales (Ga(30, 0.06) and Ga(2,0.04) with expectation 500 and 50 
respectively). In addition of the core algorithm of section 3.2, we also added 
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FIGURE 1. Time evolution of the log-likelihood for different starting point 
of the Markov Chain, chosen according to the prior distribution, and various 
parameters of the algorithm. The figure are taken from the test function blip 
of the section 4.1. 

• a Gibbs step estimation of the noise variance, with Inverse-Gamma prior distributon, 

• a Ga(2, 0.5) (with expectation 4) prior on a, with sampling of a. done through a Gibbs 
update according to the method proposed in West (1992), 
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• a Dirichlet prior on the weights of the mixture of Ga(20,0.2) and Ga(2,0.1), with 
sampling of the mixture weights done through Gibbs sampling in a standard way, 

• a Ga(5,10) (with expectation 0.5) prior on T, instead of normally Ga(a,?y), which 
add more flexibility. 

The choice of the mixture distribution as prior on scales may appear surprising, but we found 
in practice that using bimodal distribution on scales substantially improve performance of the 
algorithm, especially when there are few data available and/or high noise, because in general 
both large and small scales components are needed to estimate the regression function. 

We ran the algorithm for n = 128 and n = 1024 data, and the performance is measured 
by its average root mean square error, defined as the average of the square root of the mean 
squared error n~ l yZ/L, \f(xi) — /o(xj)| 2 , with / denoting the posterior mean and /o the 
true function. We ran on the same dataset the Translation-Invariant with hard thresholding 
algorithm (TI-H) and Synnnlet8 wavelets (see Antoniadis et al. (2001)), which is one of the 
best performing algorithm on this collection of test functions. We ran our algorithm with 
Symmlet8 kernels to make this comparison more relevant, since the choice of the kernel has 
major impact on the performance of the algorithm (see section 4.1.4 below). 

4.1.1. Alternatives. In Wolpert et al. (2011), authors develop a reversible-jump MCMC scheme 
where the random measure is thresholded, i.e. small jumps are removed, yielding to a com¬ 
pound Poisson process approximation of the random measure, with almost-surely a finite 
number of jumps, allowing numerical computations. We also ran their algorithm with a 
thresholding level of e = 0.05 (which seems to give the best performance), a Ga(15,1) prior 
on r], and all other parameters being exactly the same as described in the previous section. 
We use the criteria of section 3.3 to stop the running of the chain. 

4.1.2. Choosing the number of particles. It is not clear how to choose the number of particles 
in the algorithm. In theory, the higher is the better. In practice, however, we recommend 
choosing the number of particles according to the acceptance rate of particles weights move 
in step 3 of the algorithm. We found in practice that a level of acceptance between 20% and 
30% is acceptable, as illustrated in fig. 2. 

4.1.3. Simulation results. In tables 1 and 2 we summarize the results for location-scale mix¬ 
tures of Gaussians and Symmlet8 produced by the algorithm of section 3.2 and by the RJM- 
CMC algorithm of Wolpert et al. (2011), with the TI-H method as reference. We used p = 150 
particles for both the datasets with n = 128 covariates and n = 1024 covariates, which is a 
nice compromise in terms of performance and computational cost. Regarding our algorithm 
and the RJMCMC algorithm, no particular effort was made to determine the value of the 
fixed parameters. 

Obviously the Gibbs algorithm allow for sampling the full posterior distribution, pemit- 
ting estimation of posterior credible bands, as illustrated in kgs. 3 and 4, where the credible 
bands were drawn retaining the 95% samples with the smaller ^-distance with respect to the 
posterior mean estimator. Although the algorithm samples an approximated version of the 
model, it is found that the accuracy of credible bands is quite good since the true regression 
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Figure 2. Mean over 100 runs of RMSE versus acceptance rate in step 3 of 
the algorithm for some typical test functions. For each signal the number of 
covariates is set to 128 and the RNSR is equal to 3. 


TI-H 

Gibbs 

RJMCMC 

Function 

Symm8 

Gauss 

Symm8 

Gauss 

Symm8 

step 

0.0589 

0.0517 

0.0551 

0.0550 

0.0565 

wave 

0.0319 

0.0323 

0.0306 

0.0342 

0.0370 

blip 

0.0307 

0.0301 

0.0316 

0.0323 

0.0373 

blocks 

0.0464 

0.0343 

0.0374 

0.0383 

0.0418 

bumps 

0.0285 

0.0162 

0.0229 

0.0224 

0.0345 

heavisine 

0.0257 

0.0267 

0.0264 

0.0280 

0.0289 

doppler 

0.0443 

0.0506 

0.0418 

0.0526 

0.0493 

angles 

0.0293 

0.0266 

0.0282 

0.0274 

0.0305 

parabolas 

0.0344 

0.0301 

0.0307 

0.0312 

0.0396 

tshsine 

0.0255 

0.0285 

0.0277 

0.0291 

0.0339 

spikes 

0.0237 

0.0178 

0.0207 

0.0199 

0.0218 

corner 

0.0177 

0.0171 

0.0170 

0.0182 

0.0255 


TABLE 1. Summary of root mean squared errors of different algorithms for 
n = 128 covariates and a root signal to noise ratio of 3. 


function almost never comes outside the sampled 95% bands, as it is visible in the example 
of figs. 3 and 4. Despite the algorithm efficiency, future work should be done to develop new 
sampling techniques for regression with mixture models, mainly to improve computation cost. 

4.1.4. Discussion. Obviously, the computation cost for our algorithm is high compared to Ti¬ 
ll, or any other classical wavelet thresholding method, even considering that it can intrinsically 
compute credible bands. But, as mentioned in Antoniadis et al. (2001), the choice of the kernel 
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Figure 3. Example of simulation results using location-scale mixtures of 
Gaussians. The root signal-to-noise ratio is equal to 3 for sample size of 1024 
design points. The true regression function is represented with dashes, the 
mean of the sampled posterior distribution in blue and sampled 95% credible 
bands in pink. 
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Figure 4. Example of simulation results using location-scale mixtures of 
Symmlet8. The root signal-to-noise ratio is equal to 3 for sample size of 1024 
design points. The true regression function is represented with dashes, the 
mean of the sampled posterior distribution in blue and sampled 95% credible 
bands in pink. 
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TI-H 

Gibbs 

RJMCMC 

Function 

Symm8 

Gauss 

Symm8 

Gauss 

Symm8 

step 

0.0276 

0.0268 

0.0289 

0.0282 

0.0300 

wave 

0.0088 

0.0118 

0.0108 

0.0133 

0.0117 

blip 

0.0148 

0.0162 

0.0172 

0.0180 

0.0183 

blocks 

0.0222 

0.0230 

0.0241 

0.0247 

0.0256 

bumps 

0.0122 

0.0132 

0.0182 

0.0201 

0.0232 

heavisine 

0.0154 

0.0134 

0.0139 

0.0147 

0.0147 

doppler 

0.0180 

0.0207 

0.0196 

0.0261 

0.0225 

angles 

0.0123 

0.0120 

0.0123 

0.0125 

0.0128 

parabolas 

0.0135 

0.0124 
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0.0145 

tshsine 

0.0107 

0.0109 

0.0111 

0.0131 

0.0120 

spikes 

0.0110 

0.0075 

0.0095 

0.0095 

0.0103 

corner 

0.0077 

0.0075 

0.0081 

0.0095 

0.0085 


TABLE 2. Summary of root mean squared errors of different algorithms for 
n = 1024 covariates and a root signal to noise ratio of 3. 


is crucial to the performance of estimators. The attractiveness of mixtures then comes because 
we are not restricted to location-scale or location-modulation kernels, and almost any function 
is acceptable as a kernel, which is not the case for most regression methods. Moreover, there 
is no requirements on how the data are spread, which makes the method interesting in inverse 
problems, such as in the next section. 


4.2. Multivariate inverse problem example. Many medical imaging modalities, such as 
X-ray computed tomography imaging (CT), can be described mathematically as collecting 
data in a Radon transform domain. The process of inverting the Radon transform to form 
an image can be unstable when the data collected contain noise, so that the inversion needs 
to be regularized in some way. Here we model the image of interest as a measurable function 
/ : M 2 —> M, and we propose to use a location-scale mixtures of Gaussians to regularize the 
inversion of the Radon transform. 

More precisely, the Radon transform Rf : M + X [0,7r] —> M of / is such that Rf(r,8 ) = 
f-oo /( r cos ^ — t sin r s i n ^ + cos 0) dt. Then we consider the following model. Let n, m > 
1. Assuming that the image is supported on [— 1, l] 2 we let ri,...,r n equidistributed in 
[— x/2, y/2] and 6 1 ,..., 0 m equidistributed in [0, n\. Then, 

Y nm ~ N (Rf(x n , 6 m ), a 2 ) Vn, m 

/ ~ n, 


where n is a symmetric Gamma process location-scale mixture with base measure a Fa X F /t 
on £ x M 2 , a > 0, and scale parameter rj > 0. In the sequel, we use a normal distribution 
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with mean zero and covariance matrix diag(r, r) as distribution for F 7. Regarding Fa, the 
choice is more delicate; we choose a prior distribution over the set of shearlet-type matrices 
of the form 

1 s\ fa 0 \ 

0 1 ) ^0 y/Ej ’ 

where we set a N(l, cr^) distribution over the coefficient a and N(0, a 2 ) over the coefficient 
s. This type of prior distribution for Fa is particularly convenient for capturing anisotropic 
features such as edges in images (Easley et al., 2009). 

We ran our algorithm for n = 256 and m = 128 (32768 observations, a small amount), using 
the Shepp and Logan phantom as original image (Shepp and Logan, 1974). The variance of 
the noise is a 2 = 0.1, whereas the image take value between 0 and 2. Both the original image 
and the reconstruction are visible in fig. 5. Finally, we should mention that the choice of 
the Gaussian kernel for the mixture is convenient since it allows to compute the likelihood 
analytically. However, from a practical side, a full implementation of the algorithm with the 
intention of reconstructing CT images may benefit from using a different kernel. 


0 
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200 
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0 50 100 150 200 250 

Figure 5. Simulation of X-ray computed tomography imaging using sym¬ 
metric Gamma process location-scale mixture of Gaussians. On the left: the 
original image. On the right: the reconstructed image from 32768 observations 
of the Radon transform of the original image in a Gaussian noise. 




5. Rates of convergence 

In this section, we investigate posterior convergence rates in fixed design Gaussian regres¬ 
sion for both symmetric Gamma location-scale mixtures and symmetric Gamma location- 
modulation mixtures. 
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5.1. Notations. In the sequel we use repeatedly the following notations. 

• The conventional multi-index notation, for all a = (ai,...,oy) E and all z = 

(zi, ..., Zd) E we write |a| := a\ + • • • + ay, a! := or! ... ay!, and z a := z ... z^ d . 
Moreover, for all / : —> M with continuous fe-th order partial derivatives at x E 

we write 

p “ /w dz«°' f ezS‘ {x) - M - k - 

• Let 17 be an open subset of and 12 be the closure of 17. For any (5 > 0, we define 
C^(12), the Holder space on 17, as the set of all functions on 17 such that ||/|| c ,s := 
max |a|<fe supj-gQ \D a f(x)\ +max| Q | = fc sup x ^ en \D a f(x) - D a f(y)\/\x-yf~ k is finite, 
where k is the largest integer strictly smaller than /?. 

• We denote by | • |d the standard euclidean norm on M d , and, for any x, y E M rf , xy is 
the standard inner product. For any d x d matrix A with real eigenvalues, we denote 
Ai(j4) > ■ ■ ■ > A d(A) its eigenvalues in decreasing order, ||A|| := sup x / 0 |Ax|d/|x|d its 
spectral norm, and 11^411max := maxjj \Aij\, where Ajj are the entries of A. 

• Given a signed measure p on a measurable space we let p + and p~ denote 

respectively the positive and negative part of the Jordan decomposition of p. Also, 
\p\ = p + + p~ denote the total variation measure of p. 

• Inequalities up to a generic constant are denoted by the symbols < and >. 

5.2. The model. We consider the problem of a random response Y corresponding to a de¬ 
terministic covariate vector x taking values in [— S, 5] d for some S > 0. We aim at estimating 
the regression function / : [— S, -A M such that /(xj) = El), based on independent 
observations of Y. More precisely, the nonparametric regression model we consider is the 
following, 

Yi\ei = f(xi) + ei, i = l,...,n, 
ei,..., e n |cr 2 1- ~ d jV(0, a 2 ), independently of (/, a), 

C/» ~ n, 

with n the distribution on an abstract space 0, given by cr ~ P a independently of / drawn 
from the distribution of a symmetric Gamma process mixture. 

5.3. A general result. Let Pg j denote the distribution of of 1) under the parameter 8 = 

(/, a), Pg denote the joint distribution of (Y\,... ,Y n ), P^° the distribution of the infinite 
sequence (Ij,..., !)*>), and ||/1||n := n ~ 1 \f( x i)\ 2 - Let dehne the distance p n [8o,&l) ’■ = 

11 / — fo 11 2,n +1 log do — log <7i |. For the regression method based on n, we say that its posterior 
convergence rate at 8q in the metric p n is e n if there is M < +00 such that 

lim n ({0 E 0 : p n (8, do) > Me n }\Y \,..., Y n ) = 0 P%- a.s. (4) 

n —loo u 

Most of the approach to rates of convergence rely on idea coming from density mixtures 
models (Ghosal et al., 2000; Shen et al., 2013; Canale and De Blasi, 2013). Indeed, we prove 
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that equation (4) hold by verifying a set of sufficient conditions established in theorem 2. 
For e > 0 and any subset A of a metric space equipped with metric p, let N(e, A, p ) denote 
the e-covering number of A. i.e. the smallest number of balls of radius e needed to cover 
A. Also, for all i = define Ki(0 o ,9) := J (log dPe^i/dPg^) dPg 0ii and V2,i{Qa,9) ■= 

f (log d,P() 0 .ijdPo,i - Ki(6 o, 9)) 2 dP dQ) i , and let 

( i n I n } 

K n (0(). e) := 16 : - V K^o, 6) < e 2 , V V 2 , i (6 0 ,9) < e 2 L 
n n 

V i =1 z=l / 

be the Kullback-Leibler ball of size e around 6q := (/o,<To)- Theorem 2 is the analogue of 
theorem 5 in Ghosal and van der Vaart (2007b) for the Gaussian mean regression with fixed 
design ; the major difference reside on constructing suitable test functions, and extra cares 
have to taken regarding the fact that observations are not i.i.d. The proof of theorem 2 is 
given in section 7. 

Theorem 2. Let K := 3(32 V , and e n —> 0 with ne 2 -A oo. Suppose that © n C 0 

is such that II(0(j) < e~ 3nen for n large enough. Assume that Q n C 0 n j is such that for 
some M > 0, 

lim n J2j V N ( Me n, 0 n,i, Pn)y/n( 0 -d) e~( KM2 ~ 2 ^ = 0 , 
U(K n (6 0 ,e n ))>e ~ n . 

Then II(@ E 0 : p n {6 o, 6) > llMe n |Yi,..., Y n ) — > 0 in P^-probability. 

5.4. Supplementary assumptions. In order to derive rates of convergence (and only for 
this) we make supplementary assumptions on the choice of the mother function g and of the 
base measure aF. 

5.4.1. Location-scale mixtures. We restrict our discussion to priors for which the following 
conditions are verified. We assume that 

• g : A M is a non zero Schwartz function such that |p(x)| < exp(— Cq\x\j) for 

some Co,t > 0. We assume that there is 0 < 7 < 1 such that sup| a i =fe \D a g(0)\ < 
exp( 7 fc log k) for all k large enough ; this last assumption is not obvious, it is for 
example met with 7 = 1/2 if g is a multivariate Gaussian (see proposition 14 in 
appendix). 

• aF := oFa x F tJ . where Fa is a probability measure on £ s , the space of symmetric 
positive definite d x d reals matrices, and F ^ a probability measure on [—25, 2 S] d . 
We also assume that there exist positive constants n > 0, k* > d(d — 1), aq,... , 05 , 
b \,..., & 6 , C],..., G 3 such that for any 0 < si < ■ ■ ■ < any zo E [—25, 25] d , all 
t E (0,1) and all x > 0 sufficiently large 

F^z : \z - z 0 1 < t) > bit 0,1 , 

F a (A : A rf (A' 1 ) > x) < 6 2 exp(-C' 2 x a2 ), 

F a (A : Ai(A- x ) < 1/s) < b 3 x ~ a3 , 


(5) 

( 6 ) 
(7) 
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Fa (A : Sj < A j(A x ) < Sj(l + t), 1 < j < d) > b A s a d A t a& exp (-C 3 s d /2 ), ( 8 ) 

F a (A : Ai(A)/A d (A) > x) < b 6 x~ K \ (9) 

Equations ( 6 ) to ( 8 ) are classical and are met for instance with k = 2 if F A is the 
inverse-Wishart distribution (Shen et ah, 2013, lemma 1). For a thorough discussion 
about equation (9) we refer to Canale and De Blasi (2013) and references therein. 

• P a is a probability distribution on (0, oo). We also assume that there are positive 
constants 07 , 08 , 09 , 67 , 6 s, Cs, and 69 eventually depending on do > 0 , such that for 
all t G (0,1) 

P a (a : a > x) < b^x~ a7 , (10) 

P a (a : a < l/x) < bg exp(— Cgx a8 ), (11) 

P a (a : do < a < do(l + t)) > bgt a9 . ( 12 ) 

5.4.2. Location-modulation mixtures. We restrict our discussion to priors for which the fol¬ 
lowing conditions are verified. We assume that 

• g : —> M is a non zero Schwartz function such that g(x) > 0 for all x G 

and \g(x)\ < exp(—Co\:x\ d ) for some Cq > 0 and r > 1. We assume that there is 
a set E C [— 7 r, ir] d with strictly positive Lebesgue measure and a constant C > 0 
such that g(x) > C on E. We also assume that there is 0 < 7 < 1 such that 
sup| Q i =fc \D a g{Q)\ < exp( 7 fclogfc) for all k large enough. As in the previous section, 
these assumptions are met for the multivariate Gaussian with E = [— 7 T, vr] rf , 7 = 1/2 
and r = 2 (see proposition 14 in appendix). 

• aF := aF ,c x Fj, X Fx, where Fc is a probability measure on M d , a probability 
measure on [— 2S, 2S] d , and F^ a probability measure on [0, 7 r/ 2 ]. For all t G (0,1) 
and all zo £ [— 2S, 2S] d we assume that F fl satishes equation (5). We assume that 
there is positive constants aio,&io such that for all t G (0,1) and all <pg G [0, 7 t/2 ] we 
assume that F^{<j) : \<f> — cj) o| <t)> bigt ai °. We also assume that there exist positive 
constants rj > (d — l)/2, 012 , 013 , 611,612 such that for all t G (0,1), all £0 G and 
for all x > 0 

F& : \Z\ d >x) <bn(l + xr 2{v+1) (13) 

F& : \Z-Z 0 \ d <t)>b 12 m-^. (14) 

• P a satishes the same assumptions of equations (10) to (12). 

5.5. Results. Theorem 2 serves as a starting point for proving rates of contraction for sym¬ 
metric Gamma process location-scale and location-modulation mixtures in the model of sec¬ 
tion 5.2. The proofs of the next theorems resemble to de Jonge and van Zanten (2010), but, 
they consider only a location mixture with locations taken on a lattice, allowing for a very 
specific construction of the sets © n . Here, we do not assume that locations are spread over a 
lattice, which makes the construction of 0 n more involved. Our construction is inspired from 
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Shen et al. (2013) for Dirichlet processes mixtures, but adapted to symmetric Gamma pro¬ 
cesses (indeed, the same construction should work for many Levy processes). Also, theorem 2 
allows for partitioning © n onto slices O n ,j , a step which is unnecessary for location mixtures 
(de Jonge and van Zanten, 2010; Shen et ah, 2013), but yields to better rates and weaker 
assumptions on the prior when dealing with location-scale (Canale and De Blasi, 2013) and 
location-modulation mixtures. 

Regarding the model of section 5.2, with deterministic covariates xi,... , x n arbitrary spread 
in [-S, 5] d , we have the following theorem for location-scale mixtures. We notice that unlike 
de Jonge and van Zanten (2010), we do not assume that the covariates are spread on a strictly 
smaller set than [-S, S'] i.e. the support of the covariates and the domain of the regression 
function are the same. 

Theorem 3. Let ( = 1 V 2/(r — 7 r). Suppose that /o E C ' 0 [—S, S']^ fo r some S > 0. 

Under the assumptions of section 5-4, the equation (4) holds for the location-scale prior with 
e 2 = n —2/3/(2/3+d+ K /2) ( - logn ^2/3d(C—l)/(2/3+d+ K /2)_ 

Theorem 3 gives a rate of contraction analogous to the rates found in Canale and De Blasi 
(2013), that is to say, suboptimal with respect to the frequentist minimax rate of convergence. 
Indeed, if one use an Inverse-Wishart distribution for Fa, then re = 2; we can achieve re = 1 
with a distribution supported on diagonal matrices which assign square of inverse gamma 
random variables to non-null element of the matrix. Obviously, the choice of Fa matters 
since it has a direct influence on the rates of contraction of the posterior. Also notice that 
the rates depends on re/2, which is slightly better than the re dependency found Canale and 
De Blasi (2013). The reason is relatively artificial, since this follows from the fact that we 
put a prior on dilation matrices of the mixture, whereas they set a prior on square of dilation 
matrices (covariance matrices). 

Location-modulation mixtures were never considered before, because they are not satisfac¬ 
tory for estimating a density. In comparison with location-scale mixtures, the major difference 
in proving contraction rates rely on approximating sufficiently well the true regression func¬ 
tion. We use a new approximating scheme, based on standard of Fourier series analysis, 
yielding the following theorem. 

Theorem 4. Suppose that /o E C^[—S,S] d for some S > 0. Under the assumptions of 
section 5-4, the equation (4) holds for the location-modulation prior with 

e 2 n = n -2/3/(2/3 +( i)^ logn ^2^(2d+l)/(2/3+d)_ 

Although it was not surprising that location-scale mixtures yield suboptimal rates of con¬ 
vergence, we would have expected that location-modulation mixtures could be suboptimal 
too, which is not the case (up to a power of logn factor). Moreover, location-modulation mix¬ 
tures seem less stiff than location mixtures (Shen et al., 2013), hence they might be interesting 
to consider in regression. 

Finally, it should be mentioned that all the rates here are adaptive with respect to (3 > 0; 
that is location-scale and location-modulation mixtures achieve these rates simultaneously for 
all f3 > 0. 
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6. Proofs of section 3 

6.1. Preliminaries on convergence of signed random measures. It is well known for 
random (non-negative) measures that it is enough to show weak convergence of finite dimen¬ 
sional distributions on a semiring of bounded sets generating A to prove vague convergence 
of the distribution, see for instance Kallenberg (1983, Theorem 4.2) or Daley and Vere-Jones 
(2007, Theorem 11.1.VII). This fact remains true for random signed measures, but not in a 
obvious way. Indeed, it is well known that the vague topology is not metrizable on A4(X), 
even if X is Polish (for example, see Remark 1.2 in Del Barrio et al. (2007)), making the 
vague topology nasty to handle on A4(X). In particular, it is not as direct as in the case 
of non-negative measures to prove that the cr-algebra generated by the sets {{/i £ A4(X) : 
/i(-B) £ A} : A £ =5^(M), B £ 1Z}, where TZ is a ring of bounded sets generating A, coincides 
with the Borel er-algebra of Ai(X), given the topology of vague convergence. However, once 
this last fact is proved, everything in the proof of Kallenberg (1983, Theorem 4.2) remains 
valid for signed random measures. 

Surprisingly, there is not so much literature on vague convergence of signed random mea¬ 
sures, and as our knowledge, the only reference available on this subject is Jacob and Oliveira 
(1995). We state here the result of interest for us, with only a sketch of the proof, as the 
details can be found in the original article. 

Lemma 1 . Let TZ C A denote the ring of bounded Borel sets of X . Then the Borel a- 
algebra of JZi(X) (given the weak-* topology) coincides with the a-algebra generated by the 
sets {{fa £ M. : ti(B) £ A} : A £ ^(M), B £ TZ} and also {{yu £ M. : /r(/) £ A} : 
A£^(R), /£ C c (X)}. 

Sketch of proof. First, we shall prove that S : = cr{{/r £ A4 : /x(R) £ A} : A £ B £ 

7Z} = a {{/a £ M. : p,(f) £ A} : A £ 3&(R), / £ C c (X)}. Using the Hahn-Jordan de¬ 
composition of signed measures, this is a straightforward adaptation of Kallenberg (1983, 
Lemma 1.4). 

Also, the argument of Kallenberg (1983, Lemma 4.1) for proving S C remains valid 

here, but the converse inclusion is not as direct. Let A4 + C A4 denote the cone of non-negative 
measures, and endow A4 + with the topology Tf r of vague convergence (i.e. p, n converges to 
fjL if Hn{f) ~t At(/) f° r an y / 6 C^) and corresponding Borel (j-algebra We denote 

5 + the trace of S over A4 + . Hence, it suffices to prove that 

(1) 5+ = &(M+), 

( 2 ) P : (A4,S) —>■ (A4 + x Al + ,5 + x 5 + ), such that P(p) := (/r + ,^“), is measurable, 

(3) R : (A4 + x M. + , AS(M + ) x —> (A4, &(A4)), such that R(/i, v) := p — u, is 

measurable. 

These 3 conditions imply that R o P : (A4,5) —> (A4,&(Ai)) is 5/l^(Al)-measurable, and 
since Ro P is just the identity mapping, this implies &(M) C S, as required. □ 


6.2. Proofs. 
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Proof of theorem 1. In the whole proof, we use the Pochhammer symbols x^ and ( x) n for 
respectively the nth power of the increasing factorial of x, and the nth power of the decreasing 
factorial of x. Once we took care of subtlety coming with section 6.1, the rest of the proof 
is identical to the proof of Proposition A.l in Favaro et al. (2012), which we resume here for 
the sake of completeness. According to section 6.1 it is enough to check that 

(Q P (Ai), ■ ■ •, Q P (A k )) A (Q(A },..., Q(A k )), (15) 

for any collection of disjoints bounded measurable sets A\,..., A k € A, where Q is a symmetric 
Gamma random measure with parameters aF(-),rj. Oviously, for any vector (ui,..., v k ) £ 
the random variable viQ{A\) + • • • + v k Q{A k ) has symmetric Gamma distribution, and hence 
is determined by its moments (because of proposition 10), by Billingsley (2008, Theorem 30.2) 
the equation (15) holds if 

E [Qp(A\) ri ... Q p (A fc ) r *] —► E [Q(Ai) ri ... Q(A k ) r *] (16) 

holds for any disjoints bounded measurable sets A\,...,A k £ A and any positive integers 
r±,... ,r k . From now, for all collection of measurable sets A\,...,A k E A, we set A c : = 
X\ U*L] A{. We recall that if [X, £ X : i < 1 < p} is a Polya urn sequence with base 
distribution aF(-), and A \,..., A k £ A are disjoints, then 

P(#{i : X t £ At} = ji, • • •, #{i : X, E A k } = j k ) 

p \ (aT(Ai))^ 1 ) ... (aF(A k ))(A)( a F{A c )YP-^i=iji) 

h • • • 3k) (p-Ei=iji)'-a (p) ’ 

where (ji, • • • ,.7 fc) G £k,p, with £ k)P := {(ji,...,jk) G {0, ...,p} fc : 3i < P}- lt is 

straightforward to show that both the lhs and the rhs of equation (16) are null whenever one 
of the rf s is odd. Therefore we shall only consider equation (16) for even exponents. We 
deduce from proposition 10 that for any disjoints bounded measurable sets A\,...,A k £ A 
and any positive integers r\,... ,r k , 

E {QMP- ■ ■ ■ QMX“] = (n 

p \ (aF(Ai))^i) ... (aF(A k ))(A)(aF(A c ))(P-^i=iii) 

■■■jk) 

x(h) {ri) -.-(3k) {rk) - 

Introducing s(-, •) and S(-, •) are the Stirling numbers of the first and second kind, we can 
mimic Favaro et al. (2012, Appendix A.l) to find that 




E [QplAtf- ... QMX-] = a (r ' + - +r * ) (f[ 






20 


ZACHARIE NAULET AND ERIC BARAT 


ri mi r k m k 

x l s ( r i,mi)| ^2 S{rrii,si) ■ ■ ■ ^2 \ s ( r k,m k )\ S(m k ,s k ) 

777.1 —0 si=0 rrik=0 Sfc=0 

.. (aF(A 1 ))(«)...(aF(A*))“, , 

X --(?)•! + -+» 

Therefore, we conclude that, 


lim E [Q p (Ai) 

p—> OO L 


2ri 


•••«p(A) 2r *]=n 


( 2 r f )I (aF(A)) lr ‘>' 


2 — 1 


r ? ;! 


(V»7) 


2r; 


□ 


Proof of proposition 1. We can assume that all the Q p and Q are defined on the same prob¬ 
ability space (fi, J~, P). The proof is an adaption of Favaro et al. (2012, Theorem 2). We 
just have to take care that here, we proved Q p —> Q vaguely in theorem 1, which does not 
necessarily imply that f^ p \x) —> f^\x) pointwise. But by assumption, x H > K(x;y) is 
continuous and vanishes outside a compact set, and it is easily seen that the sequence of total 
mass |<5|(-;cu) is almost-surely bounded, then by Bauer (2001, Theorem 30.6) (which remains 
valid for signed measures), we have f^ p \x) —> f^\x) pointwise, almost-surely. The end of 
the proof is identical to Favaro et al. (2012, Theorem 2) for convergence in L 1 , and extension 
to L q with 1 < q < +oo is straightforward. □ 


7. Proofs of section 5.3 

Proof of theorem 2. The proof is similar to Ghosal and van der Vaart (2007b, theorem 5). 
The event A n that f dPg ' (Y) dH(0) > e~ 2 ne n / 2 satisfies Pg 0 (A £) — > 0 by Lemma 10 in 
Ghosal and van der Vaart (2007a) and assumptions on II. Therefore, 


Pim C n\Yl, -..,Y n )< P^me^Y,, ...,Y n ) IaJ + PZMn) 

dPe,i 


< e 


0 o l 

2ne1 Tjn 


‘P 


e 0 


'e; 


n 


L = i dP e0) , 


(Y i )dU(9) + P^(A c % 


< e 


2 ne 2 


^(©n) + Pqq i-^n) 0 ) 


where the last lines follows by Fubini’s theorem. For 0 < a.j < 1, and n large enough, the 
lemma 2 states the existence of tests functions ip n j such that 


Bo 


'fn,j ‘ZoijN(Me n , ®n,j , Pn) e 


-KM 2 ne 2 


Pei 1 - V’nj 


) < ou 


for all 6 £ @ n j with p n (0. 6 q) > llMe n . Letting U e := {9 £ 0 : p n {6o,6) > llMe n }, 


p e 0 [n(C4 n & n ,j\Yi, Y n ) t An ] 
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< 1 Vnj+Pt, (a -<Ki) / 

< ppjnj + sup P£( 1 - ip nJ )U(e nJ ) e 2Kin£ ™ 


u e ne„ 


< 2ajN(Me n , @ n ,j,Pn ) e 


-KM 2 nel 


+a- 1 n{e n j)e~ {KM2 - 2 ^/ 2 , 


where we used Fubini’s theorem again. Put ctj = ^yil(O n j)/N(Me n , ® n ,j , Pn) (notice that 
aj < 1 ) and sum over j to obtain the result in view of the last equation. □ 


7.1. Existence of tests. Here we construct the test functions required in the proof of the¬ 
orem 2. We proceed in two steps. First, we construct tests for testing the hypothesis that 
0 = #o against 0 belongs to a ball of radius e /12 centered at 9\ with p n (9o^i) > e j then in 
lemma 2 we construct the tests used in the proof of theorem 2 . 

Let 9 0 = (/o,0"o), #1 = (/i,di), 0io = (/i,0- o ), s = \A 2 + (108/ra) log(l/a), and define, 

,2 1 


A n := < y E M n : ^ log 


dPe 0 ,i , x ne 


2=1 


dP 


0io, i 

n 


(Vi) < -FFL 2 +21og«P 


96 ( Jq 


B^:= !/£*": n(l - d/3) < £ 


2=1 


Vi ~ /o(^ 
O "0 


< n(l + d/3) >. 


Then we construct the sequence (4> n )n> o as 


</> n (yi,...,y n ) 

: = i A „(yi, ...,Y n ) + t Bn {Yu ... ,y„) - ujyi,... ,y„) wn,..., y„). 

Proposition 3. Let I\ = 3(32V4cjg) 1 . The tests <p n defined above satisfy Pg 0 4>n < e ^ ne2 / 144 

and sup 6 , ge:pn ( 0ei ) <e / 12 P e n (l - fi n ) < Q~ Kn( ? f or a ll 0i € 0 with p n (0 o ,0i) > e and all 
0 < e < 1. 


Proof. Type I error of fi n . It is clear that Pg Q 4> n < P/( (A n ) + Pg Q (B n ). Moreover, by proposi¬ 
tion 4 in Birge (2006), we have Pg o (A n ) < a e _ne 2 /( 192 CT o), and regarding the proof of lemma 7 

in Choi and Schervish (2007), the bound Pq q (B n ) < 2e _?l ' 52 / 108 = 2a e - ™" 2 / 108 holds for n 
sufficiently large. 

Type II error of fi n . Let 9 = ( f,a ) be such that p n (9,9i) < e/12. Clearly, Pg 1 (l — 
4> n ) = Pq{ 1 — 1a„)(1 — 1b„) < Pg{A c n ) A Pq(B°). We should consider two situations, either 
| log (To — logcril < e/ 2 , or | logoo — logo'll > e/ 2 . 

• If | logo'o — logo'll < e/2, then p n (0o,0i) > e implies ||/o — /i||2,n > e/2, and for all 0 
with p n (0, 0i) < e/12, it is clear that \\f — fi\\ 2 ,n < e/12. It follows from proposition 4 
in Birge (2006) that 


mK) < ^P 


n||/o - fi |||,n - ne2 / 8 + 24o'g log a 


24a 0 2 


< — exp 
a 


ne 

64a 2 
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If | log Co — logo'll > e/2, then p n (8,0 1 ) < e/12 implies | logo - — logoo| > 5e/12. We 
should again subdivise this case, considering either g/gq > 1 or not. For both cases 
we mimick and adapt the proof of lemma 7 in Choi and Schervish (2007). 

— If a/a o > 1, because | log a — log gq \ > e/3 we have a > ooe £ / 3 , and thus <r > (1 + 
e/3)o'o for any e > 0. Let W ~ Xn an d let Yf' have a noncentral \' 2 distribution 
with n degrees of freedom and noncentrality parameter X/ 2 =i ~ fo{ x i )) 2 - 
Then, 


Po(K) <Pe(Yl 


\ i —1 


Yi - f 0 (xi) 


vo 


< n 


= Pr ( W' < n°^ ( 1 + - ) ) < Pr 


- (j2 



W < 




But whenever 0 < a < 2, we have 




(T z 


1 + A < 


1 + 5/3 


< 


+ 


(108/n) log(l/a) 


(l + e/3 ) 2 - l + e/3 6 e(l + e/3 ) 2 


Therefore, by Markov’s inequality we get for all t < 1/2 


6 e(l + e/3 ) 2 

Choosing t = — e/18 leads to 
log(l/a) 


Pe( B n ) < exp 


1 


< — exp < — 


a 


(1 + e / 3) 2 

7 ne 2 
"648" 


exp 


l + e/3 
e/9 


< — exp 
a 


l + e/3 
2 


- log(l + e/9) 


ne 

"93" 


because we have 0 < e < 1. This concludes the proof when (J/gq > 1. 

— On the other direction, g/gq < 1 and | loger — logciol > 5e/12 imply that g < 
(1 — e/3)(Jo for any 0 < e < 1. Using the same strategy as in the previous item 
it is possible to show that the bound Pg(B^) < ( 1 /a) e _ne2 / 1536 holds. □ 

Lemma 2. Let Q n C O and K := 3(32 V 4<Tq) -1 . Then for any 0 < a < 1 there exists a 
collection of tests functions (if n ) n > 1 such that for any 0 < e < 1/12 and any n > 1 


Po Q fn < 2aN(e , 0 n , p n ) e 


— Kne 2 


sup 

eee n -. Pn (9,e 0 )>iie 




Proof Let N = lV(e/12, 0 n , p n ) denote the number of balls of radius e/12 needed to cover 0 n . 
Let (.£> 1 ,... ,Bn) denote the corresponding covering and (£ 1 ,..., (n) denote the centers of 
(B 1 , ..., Bn). Now let J be the index set of balls Bj with p n (^o, Cj) > e - Using proposition 3 
for 0 < e < 1 and for any ball Bj with j £ J, we can build a test function <f> n j satisfying 

Pe 0 4>n,j < 2ae~ Kne2/lu , sup Pg( 1 - r/> nj ) < a -1 e ~ Kne2 / lu ? 
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Let if n := rnaxjgj f/> nj . Then Pg^n < HjeJ P 0 o ^n,j < 2aN(e/12,Q n , p n )e Kne2/ 144 and 
also Pg( 1 — ipn) < mmj e jswpg, GBj Pq,( 1 — 4> n ,j) < a -1 e _/< " ne2 / 144 for any 6 E 0 n with 
Pn(Mo) > lle/12. ' ’ □ 


8. Proof of theorem 3 

We prove theorem 3 by verifying the set of sufficient conditions established in theorem 2. 


8.1. Sieve construction. For constants H,M >0 to be determined later, we define the sets 


@r 


V n :={Ae£ s : n" 1 ^ < \{A) <n~ 1 ^(l +Me n /n) n \ t = 

n - 2 /a s < a 2 < n - 2 / a s(i_| -Me n ) n , f(x) = f Ka(x — p)Q(dAdp), 

< (f . Q = J2iZi u idAi,m, suppQ = £ s x [~2S,2S] d , |«;| < n, 

#{i : |«j| > n _1 , Ai E £>„} < iLne^/log rc, 

Kl i v n} < Me n , X“l M 1 (I U *I < n ~ 1 } < Me n 


In the sequel, we assume without loss of generality that the jumps of Q in the definition of 
0 n are ordered so that there is no jump with \ui\ > n _1 and Aj E T> n when i > Hne^/ logn. 
Moreover, we consider the following partition of 0 n . Let H n the largest integer smaller than 
Hne\l log n. Then for any j = {j \,..., jn n ) £ {1, 2,.. .}^ n , inspired by Canale and De Blasi 
(2013, theorem 2) we define the slices 

Q n ,j ■= {(/, <r) £ ©n : n 2H ~' < X 1 (A i )/\Mi) < n 2H Vi < H n }. 


Lemma 3. Assume that there is 0 < 71 < 1 such that e 2 > n 71 for all n large enough. Then 
for H = 6(1 — 71) -1 it holds II(0(j) < exp(—3ne^) as n —> 00 . 


Proof. From the definition of 0 n , it is clear that 


n(0£) < n (#{i : \ui\>n x } > Hne 2 J\ogn) + II (X~i \ u i\ > n) 

+ n (EZi N < n- 1 } > Me n ) + n (YZi N t v n} > Me n ) 

+ P a (a 2 < n~ 2/as ) + P a (a 2 > n~ 2/a *( 1 + Me n ) n ). (17) 


The bounds on the two last terms are obvious in view of equations (10) and (11). 

By the superposition theorem (Kingman, 1992, section 2), for any measurable set A C 
f x R rf we have Q(A) := Q\{A) + Q 2 (A) where Q\ and Q 2 are independent signed random 
measures with total variation having Laplace transforms (for all measurable A C £ x W d and 
all t E M for which the integrals in the expression converge) 


Ee t|Qi|(A) =exp | 2aF ( i4 ) J ( e tx —l)x _1 e~ vx , 
■^qAQi\{A) _ ex; p 1 2aF(A) [ (e tx — T)x~ l e~ vx dx\ . 


(18) 

( 19 ) 
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The random measures Q\ and Q 2 are almost-surely purely atomic, the magnitudes of the 
jumps of Q 1 are all > ?r _1 , whereas Q 2 has jumps magnitudes all < n (almost-surely). 
Also, the number of jumps of Q 1 is distributed according to a Poisson law with intensity 
aE\(n~ l / if ), where E\ is the exponential integral E\ function. Recalling that E\(x) x 7 + 
log(l/x) for x small, it follows a (7 + log 77 ) < a.E\(rC x /r\) < 2a logn <C Hne^/ log n when n 
is large. Then using Chernoff’s bound on Poisson law, we get 


II (#{z : \ui\> n *} > Hne 2 n / log n) 

< e -aE l{ n-i/ V ) (e^lCn-V^^/logn 

(Hnel/ logn) Hne n/i°s n 

< (r?e 7 ) Q exp / — ^ n6n f log ^ n6n _ l 0 g( 2 ea log n)^ 1 . 

( log n \ log n ) J 


But, 


log 


Hnen 
logn 


log( 2 ea log n) > (1 — 71 ) log n 


2 log log n + log 


H 

2ea ’ 


which is in turn greater than (1/2)(1 — 71 ) logn when n becomes large. This gives the proof 
for the first term of the rhs of equation (17). 

Regarding the second term of the rhs of equation (17), it suffices to remark that the random 
variable \ui\ has Gamma distribution with parameters (2a, rj). Then the upper bound 

on nfyyy \m\ > n) follows from Markov’s inequality. With the same argument, we have that 
the random variable \ u i\ ^{l u i| ^ n -1 } is equal in distribution to |(J 2 |(£ x K rf ) ; thus the 

bound for the fourth term of the rhs of equation (17) follows from Markov’s inequality and 
equation (19), because 


n e 


,3ne n \Q2\ > g 3ne 


< e 3ne " exp < 2 a 


(e nenX - l)x _1 e~ vx dx \ < e ~ 3ne " 


The hfth term of the rhs of equation (17) is bounded using Chebychev’s inequality. Indeed, 
with the same argument as before, the random variable X := |iq| 1 {Ai ^ T> n } has 

Gamma distribution with parameters ( 2 ai 7 4 ('D((), rj). Hence for n sufficiently large we have 
EA' = 2aF /rj < e n /2, and 

n(A > e n ) < n(A - EX > e n /2) < 8aF ^n) _ 

Then the result follows from equations ( 6 ) and (7). □ 

Lemma 4. Let e n — > 0 with ne \ —> 00 and I\ = 3(32 V 4<Tq). Then there exists M > 0 such 
that it holds y/N(Me n , @ n j, p n )y/U(@ n j) e -( AA/ "- 2 ) ne n -7 0. 

Proof. Define the random measures Q 1 and Q 2 as in the proof of lemma 3. Then using the 
Poisson construction of Q 1 (see for instance Wolpert et al. (2011, section 2.3.1)), it follows 
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from equation (9) that for any j £ {1,2,.. ,} Hn 

n(©nj) — Y\i<H n F A (A : Ai(A)/Arf(A) > n 2 ) < b 6 " n K 2 

Moreover, using proposition 4 we can find a constant C > 0 independent of M such that 
N(Me n , ©n,j) Pn) < e -2 CHne 2 n <i(d~i )/2 J2i<H„ 2J * when n i s large. Therefore, 

We^) < exp {#ne 2 (<7 + } IW* n^" 1 )-^ 2 ^ 1 . 

For n large enough we have log66 < 2CTogn ; then provided n* > d(d — 1), we can sum over 
j £ {1, 2,.. .} Hn the last expression to get 

E j y/N(Me n , 0„j, p„) V n (0nj) < exp {2C7Ln€ 2 } (Efc>i ) 

< exp {iL(2C + ft*/2)ne 2 } . 

Now choose M > 0 satisfying KM 2 > 2 + H{2C + «*/2) to obtain the conclusion of the 
lemma. □ 

Proposition 4. For n large enough there is a constant C > 0 independent of M such that 
for any sequence e n —> 0 with ne 2 —» oo, the following holds for any j £ {1, 2,.. ,} Hfl . 

log N(Me n ,@ n> j,p n ) < CHne 2 + ^ 0 — log re 22 2 ji . 

i<H n 

Proof. The proof is based on arguments from Shen et al. (2013), it uses the fact that the 
covering number N(Me n , @ n , p n ) is the minimal cardinality of an Me„-net over © n in the 
distance p n . Let 6 n := Me n n~( 1+1 / a2 \ R n be a <5 n -net of [—2S', 2S] d , A n be a Me n -net 
of {(u\,... ,un n ) £ M^ 71 : Ei=u \ u i\ — n l i n the ^-distance, and S n := {a > 0 : <r 2 = 
n -2/a 8 (i _)_ Me n ) k , k £ N, k < n}. Also, for any k > 1 let Ok be a n~^ 2k+1 ^ Me n -net of the 
group of d x d orthogonal matrices equipped with spectral norm || • |j, and define 

A = PAP t , P £ 6 k , A = diag(Ai,..., X d ), 

A j = n -1 / Q2 (l + Me n /n) k , k £ N, k < n 2 , j = 1,... ,d 

Pick (/, cr) £ @ n j with f(x) = EEi u i (x — pf). Clearly we can find u £ A such that 
E i<H n \ u i ~ “il ^ Me n, P e Rn 71 such that I Pi - Pi\ d < 5 n for all* = 1,..., H n . and a £ S n 
such that | logo - — log a\ < Me n . We also claim that we can find Ai £ 'D n ,j i such that 
W 1 - || < 2>dMe n /n for all i < H n . We defer the proof of the claim to later. Let 

f{x) = E i^ H Ui K^ (x — fii) denote the function built from the parameters chosen as above 
; it follows 

||/-/lkn< 22 H + Y I Ui-Ui\ + ^2 \ u i\\\ K Ai(- - Pi) - K^i(- - Pi)\\2,n 
if>Hn i<H n Hn. 

<2 Me n + C 22 \ u i \Ik — 1 2 lj|| + C‘ 2^] killl^j ^\\\pi — Pi\d 

i<H n i<H n 


V, 


n,k 


= < A £ Vr 
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< M(2 + C' + 3 C'd)e n , 


where the two last inequalities hold by proposition 12 for a constant C' > 0 depending only 
on g 1 and because ||j4i~ 1 || < n 1 /“ 2 for all i < H n . Thus a (2 + C' + 3C'd)Me n -net of Q n ,j 
in the distance p n can be constructed with (/,<?) as above. Recall that < ( AS/5 n ) d , 

#A n < (n/(Me n )) Hn , #S n = n and It turns out that 

O z 'Dn,k < n2d x Then the total number of (/,?) is bounded by a multiple constant of 


n x 




n 

i<H n 



n 


2^+1 


d(d- 1)/2 n 


Me, 


Finally, P n | log M| <C H n log n when n is large proving that the constant C > 0 can be chosen 
independent of M, and the constant factor 2 + C' + 3C'd can be absorbed into the bound. 

It remains to prove that for any A E P n with Ai(^4)/Ad(A) < n 2 we can find A E T> n ,k 
such that || I — < 3 dMe n /n. Let A =: PAP T denote the spectral decomposition of 

A (recall that A is symmetric). Clearly, we can find a matrix A := PAP T in P n ,k with 
||P — P|| < n ~( 2fc + 1 ) Me n and 1 < Aj(A)/Aj(A) < 1 + Me n /n for all j = 1 ,...,d. Let 
A := PAP T and remark that 

||J - A _1 A|| < ||P - A _1 A|| + ||Ar 1 l||||/ - A _1 A|| 

< ||P - A" 1 !)! + ||P - 1 _1 A||(1 + ||P - A _1 A||). (20) 

Let B := P T P - I, so that ||P|| max < ||P|| < ||P T P - P|| < ||P - P|| < n-( 2 ‘ + %e„, and 
I - A' 1 A = P(B - A~ 1 BA)P t . It follows, 


I-A~ l A\\ < ||P 


A _1 PA|| < d||P|| max 


Ai(A) 
Ad (A) 


< dMe n /n , 


because the entries of B — A 1 PA are equal to Pjj(l — Aj/Ai) and || ■ || < d|| ■ || max . Moreover, 
/ — A " 1 A = P{I — A~ 1 A)P t implies ||P — A _1 A|| < dMe n /n. Then the conclusion follows 
from equation (20). □ 


8.2. Approximation of functions. In order to prove the prior positivity of Kullback-Leibler 
balls around 9q , we need to approximate /o E C@\—S, S] d by hnite location-scale mixtures of 
kernels. We mostly follow the approach of de Jonge and van Zanten (2010, lemma 3.4). 

Nevertheless, as mentioned in de Jonge and van Zanten (2010), we shall extend /o defined 
on [-S, S] d onto a (smooth) function defined on to be able to approximate properly 
/o; otherwise we could have troubles at the boundaries of [— S, P] rf . Clearly, without any 
precaution, h~ d Khi * fo(x) —> fo(x)/2 as h —> 0 when x belongs to the boundary of [-S, P] d . 
de Jonge and van Zanten (2010) assume that the covariates are spread onto [a, b\ d with a > —S 
and b < S and extend /q by multiplying it by a smooth function that equal 1 on [a, b] d and 
0 outside [— S, P] rf . Here we assume that the covariates are spread onto [— S, «S] rf and we use 
Whitney’s extension theorem (Whitney, 1934) to find a function /o : —> M such that 
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/o G C^(M d ) and D a fo(x ) = D a fo(x ) for all x G [—5, 5] d and all |a| < /3. Then we apply the 
method of de Jonge and van Zanten (2010, lemma 3.4) to fo■ We find this approach more 
elegant since we do not have to assume that fo is defined on a larger set than the support of 
the covariates. 

For each a G N d , let m/ := h~ d f x a Khi(x) dx. For a G with |a| > 1, define two 
sequences of numbers by the following recursion. If |a| = 1 set c a = 0 and d a = —1/a!, and 
for I a I > 2 define 


Cql 


v (—1)H 

al 

l-\-k=a 


/ m/m 

V m a 



dk i 


da ■ — 


(-l)H 

a! 


Col* 


( 21 ) 


Given (3 > 0, h > 0 and p the largest integer strictly smaller than /3, define 

fe ■= fo - XI d « m « Da f°- 

l<\a\<p 


Proposition 5. Let h > 0. For any (3 > 0 and any function fo G C^[— S, S'] 6 * there is a 
positive constant AIp such that \h~ d Khi * fp{x) — fo(x)\ < Mph & for all x G [ — S, 5] d . 

Proof. Noticing that m/ < h^ a \ the proof follows from the same argument as in (Shen et al., 
2013, lemma 2), because fo(x) = fo(x) for all x G [— S, 5] d . □ 


The proposition 5 shows that any sufficiently regular function can be approximated by 
continuous location mixtures of K^j. provided h is chosen small enough and g has enough 
finite moments. In the sequel, we will need slightly more, that is approximating any /3-Holder 
continuous function by discrete mixtures of K^i ; this is done by discretizing the convolution 
operator in the next proposition. Compared to Ghosal and van der Vaart (2001, lemma 3.1), 
we need to take extra cares regarding the fact that fo can take negative values, and also to 
control the “total mass” of the mixing measure. 


Proposition 6. Let h > 0 be small enough and ( = 1 V2/(r- yr). There exists a discrete 
mixture f(x) = Ylf=\ a i^hl{x — gf) with N < h~ d (logh~ 1 ) d ^~ 1 \ pt G [—2S,2S] d for all 

i = 1_ ,N; such that \ f(x) — fo{x)\ < hP for all x G [— S, S] d . Moreover i l a *l ^5 d ~ d , 

and | pi - pj\ d > hd +1 for any i / j. 

Proof. Let Q be the signed measure defined by 4 G f A fp{y)dy for any measurable set 
A C M d . Let ALh '■= (C < o" 1 (/3 + d) log hr l ) l / T . To any j G T, d we associate the cube Bj := 
hMy,(j + [0, l] d ) and the signed measure Qj such that Qj(A) := Q(AnBj) for all measurable 
A C M. d . Let Qj~, Qj denote respectively the positive and negative part of the Jordan 
decomposition of Qj. It is a classical result from Tchakaloff (1957) that we can construct 
discrete (positive) measures P//,, P~ k each having at most (k + d)\/(k\d\) atoms and satisfying 
f R(x)Q^(dx) = f R(x)Pj lz (dx) for any polynomial R(x) of degree |a| < k. Let Ah := 
{j G T, d : |j| < 1 + S/(hMh)} and for any x G let N x := {j G A h ■ inf{|a: — y\ d : 
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y G Bj} < hMh}. For the signed measure P k := YljeA h (Pj^ k — P- k ) the total variation of P k 
satisfy the bound 

i^i< Y, p tk + Y, p ik< E(^i+^) = M- 

jeA h jeA/, jez d 

Notice that \Q\ < +oo since we have fp G L 1 (M d ). Moreover, letting Pj k = P + fc — P~ k 


[ K h i{x - y)(Q - P k ){dy) = V [ g (Qj{dy) 

J mJ B ’ V * ; 

+ E / + E [ 9 (ErO (Qj ~ p i,k)(dy)- (22) 

3eK h \N x jB i V h ) j£N J B . V h J 

By assumptions on for any £ G [—S’, 5]^ the first term of the rhs of equation (22) is bounded 
by |Q|/i^ +rf . With the same argument, using the definition of N x , the second term of the rhs 
of equation (22) is bounded by 2\Q\h /3+d . Regarding the last term, using multivariate Taylor’s 
formula we write 



(Qj - Pj,k)(dy ) 


+ 



Pk 


x-y 

h 


(Qj - Pj,k)(dy ), 


(23) 


where \R k (x)\ < sup| a | =fc \D a g(0)\ \x\^/kl. The first term of the rhs of equation (23) vanishes 
by construction of Pj, k - For any j G N x and any y G Bj it holds \x — y\d < 2hM k ; then 
using Stirling’s formula and assumptions on D a g the second term of the rhs of equation (23) 
is bounded by 


sup \D a g(0)\^M^y- f \Qj — P jtk \(dy) < K\ exp{—A:(l — 7 ) log k + k\og{2eM h )} , 
\a\=k \/2-Kkk K JBj 


whenever j G N x , for a constant K\ depending only on /o, /3 and g. Therefore, choosing 
k > (2eM k ) 2 P 1 ~ 1 \ we deduce from equations (22) and (23) that 


/ 


K hJ (x 


y)(Q - P k )(dy) 


< 3\Q\h l3+d + K\ exp 


1-7 

2 


k log k > . 


(24) 


Now if (2eM/,) 2 /^ 7 ) > 2{j3 + d)/( 1 — 7 ) log/i -1 set k to be the smaller integer larger than 
( 2 eM/, ) 2 /F-7) j otherwise set k to be the larger integer greater than 2(/5 + d)/(l — 7 ) log h~ 3 . 
This yields the first part of the proposition with f(x) = h~ d f K k i(x — y)P k (dy) because of 
equation (24), of proposition 5 and because each of the P ] k has a number of atoms proportional 
to (log/i -1 )^ by Tchakaloff’s theorem, all in [— 2S, 25]^ if h is small enough. 

It remains to prove the separation between the atoms of Q' k . But the cost to the supremum 
norm of moving one yi of P 3+1 is proportional to h 13 by proposition 12. Hence we can assume 
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that the support point of Q' k are chosen on a regular grid with hP +1 separation within nodes 
(see also Shen et al. (2013, corollary Bl)). □ 


8.3. Kullback-Leibler property. A simple computation shows that (see for instance Choi 
and Schervish (2007)) for 9q = (/o, 0 o) and 0 = ( f,o ), 


Ki(6 0 ,6) = log — 
co 

V 2 .,i(0 o ,8) = * 



1 Ifo(Xj) - f(Xj) I 2 

\ a 2 J 2 a 2 

<7 Z J (J 4 


Therefore, for all 0 < e < 1/2, there exists a constant Co > 0 (depending only on $o) such 
that one has the inclusions 

K n (9o,e) D {(/,cr) : \\f - / 0 ||^ < C 0 e 2 , a 0 < a < cj 0 (1 + C 0 e 2 )}, (25) 

hence probabilities of Kullback-Leibler balls around 8q are lower bounded by the probability 
of the sets defined in the rhs of equation (25). Now we state and prove the main result of this 
section. 


Lemma 5. Let /o E C l3 [—S, 5] rf , and £ > 1 as in proposition 6. Then there exists a constant 
C > 0, not depending on n, such that II(/\ n (0o, e n )) > exp(—ne^) for 
el = C< n -2/3/(2/3+rf+K/2)( lo g n )2/3d(C-l)/(2/3+d+«;/2)_ 

Proof. By proposition 6 for any h > 0 sufficiently small, there is IV < h~ d (log h~ 1 ) d ^^ 1 ' 1 and 
a function f h (x) = J2f=i a j K hl{x - Uj) such that \fh(x) - fo(x)\ < h 13 for all x E [•-5, S] d , 
with Oij E M for all j = 1,... N, /Xj E [—25, 2 S] d for all i = 1,..., N, and |/Xj — Hj\a > h^ +l 
whenever i ^ j. Let define 

S s ,h ■= |a e £ s : h~ l < Ai(A _1 ) < /x _1 (l + hP +d ), i = 1 ,... ,d|. 

We construct a partition of £ s X [—25, 25] rf in the following way : for all j = 1,..., N, let Uj 
be the closed ball of radius h /3+d+1 centered at pLj (observe that these balls are disjoint), and 
set Vj := £ s ,h x Uj, V c := £ s x [—25, 2S] d \UjL 1 Vj. Let Q denote the set of signed measures 
on £ s x [—25, 2S] d satisfying Q E Q =>■ | Q(Vj) — cth,j\ < h! 3 N~ l for all j = 1,... ,1V, and 
|Q|(K C ) < hP. Notice that for any Q E Q we have |Q| < Y^f=i \Q( v j) ~ a h,j\ + \ a h,j\ < 
h& + h~ d < h~ d because of proposition 6 . Then for any Q E Q and all x E [—5, S] d , using 
proposition 12 , 


L 


£ s x[-2S,2S] d 


Ka(x 


h) Q(dAdp) - f h {x) 


N 


3 = 1 


N f 

+ 'Y] \K A (x - n) - K hI (x - 

3 = l Jv > 


Tj )I \Q\{dAdn) < h 13 . 
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Thus for all Q E Q and all x E [-S, S] d , we have | f Ka(x — fi) Q{dAdfi) —/o(x)| < | f Ka(x — 
n) Q(dAd/i ) — fh(x )| + \fh(x) — fo(x)\ < K\ hP for a constant AT > 0 not depending on h. By 
the assumptions of equations (5) and (8) we have for any j = 1,... ,N 

aF A (Sg^F^Uj) > a6i5 4 h ai(/3+,i+1) " a4+a5(/3+,i) exp(-C'3/i" K/2 ) 

=: K 2 h q exp{-C 3 h- K / 2 ), 

where q := a\((3 + d + 1 ) — a 4 + as(/3 + d) and the constant AT > 0 not depending on h. 

For h > 0 sufficiently small, it is clear that K 2 h q exp(— C 3 h~ K ^ 2 ) < F(Vj) < 1 for all j = 
1,..., N. We also assume without loss of generality that A' 2 / 1 9 exp(— C 3 h~ K / 2 ) < F(V C ) < 1 
and we set ITv+i := V c , aih n+ 1 := 0 j otherwise we subdivide V c onto smaller subsets for 
which the relation is verified. Because A is a probability measure, this can be done with a finite 
number of subsets not depending on h. Now let W := {<7 > 0 : (Jq < a < ao(l + Coe^)} and 
e n = . Notice that P a (W) > K 3 e^ 19 with a constant A 3 > 0 eventually depending 

on 0q. The sets £ s ^ X Uj are disjoint, hence by equation (12) and proposition 11 we deduce 
that there is a constant A 4 > 0 such that 

Et 1 / h^N~ l e -( 3 +, dl Q h 

n ( A, s a * n { rmm) 

> exp |-A 4 h _(d+K/ 2 ) (logh“ 1 ) d(c “ 1) } , 

where we used that N < h~ d (logh~ 1 ) d( ' ri ~ 1 \ YljLi \ a h,j\ 1$ 1 l ~ d and r(x) < x~ 1 for x > 0 
sufficiently small. This concludes the proof. □ 


9. Proof of theorem 4 

As in section 5, the proof of theorem 4 consists on verifying the condition established in 
theorem 3. 


9.1. Sieve construction. For constants H,M > 0 to be determined later, we define 

f(x ) = f K^^(x — g)Q(dfdgd(/)), suppQ = x [— 25, 2S] d x [0, 7 t/ 2 ], 
Q = 1 n~ 2/as < cr 2 < n“ 2 / as (l + Me n ) n 


@n ■ = 


(/w) : 


E i= i Wi\ < n, #{* : \ui\ >n |^| d < e 2ne ™} < Ane 2 /log?r, 
EEi N 1{| $i\d > e 2ne "} < Me n , YZ 1 N < n- 1 } < Me 


In the sequel, we assume without loss of generality that the jumps of Q in the definition 
of © n are ordered so that there is no jump with \ui\ > nT 1 and |£j|d < e 2ne " when i > 
Ane 2 /logn. Moreover, we consider the following partition of © n . Let H n be the largest 
integer smaller than Ane 2 /logn. Then for any j = (j 1 ,... ,jH n ) £ {1, 2, ...} " we define the 
slices 

©nj := {(/, 0 -) E Q n : - 1) < \£\ d < Vnji, Vi < H n }. 


Lemma 6. Assume that there is 0 < 71 < 1 such that e 2 > n 71 for all n large enough. Then 
for H = 6(1 — 71) -1 it holds II(©(j) < exp(—3ne 2 ) as n -» 00 . 
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Proof. According to the proof of lemma 3, the result holds if > e 2ne ™) < 

e 2 exp(—3ne 2 ) for n sufficiently large. Then the conclusion follows from equation (13) be¬ 
cause T] > 0. □ 

Lemma 7. Let e n —)• 0 with ne 2 -A oo and K = 3(32 V 4<Tq). Then there exists M > 0 such 
that it holds Yjj y/N(Me n , ® n j, p n )y/U(Q n j) e -(AA/--2)m£ -A 0. 

Proof. With the same argument as in lemma 4, it follows from equation (13) that for any 

je{ i,2 ,...} h " 

n(©nj) < n t <* n m ■■ leu > v^Ui -^ rwj 1 * + - i)r 2(??+1) - 

Moreover, using proposition 7 we can find a constant C > 0 independent of M such that 
N(Me n , ® n ,j, Pn ) < exp(2 CHnel) W i< u n i? 1 w h en n is large. Therefore, for those n 

y 7 N(Me n , ©n,j, Pn)\/n(0 rl j) 

< exp jtfne 2 ^<7 + } II jj d_1)/2 [l + yfn(ji - 1)]~0 +1) . 

For n large enough we have log&n < 2(7logn ; then provided r\ > (d— l)/2, we can sum over 
j 6 {1,2 ,...} ” the last expression to get 

£, V N (Mtn, enj,Pn)y/We^j) 

< exp [2CHne 2 n } (Ek >l A;( d_1 ^ 2 [l + y/n(k - l)] _ 0 +1 )j 

< exp {2 CHnel] (l + n" (r?+1)/2 jfeC**- 1 )/ 2 -^ 1 )^" < exp{3(7i7ne 2 }, 

where the last inequality holds for n sufficiently large. Now choose M > 0 satisfying KM 2 > 

2 + 3(777 to obtain the conclusion of the lemma. □ 

Proposition 7. For n large enough there is a constant (7 > 0 independent of M such that 
for any sequence e n — > 0 with ne 2 -A oo, the following holds for any j G {1, 2,.. .} Hn . 

log N(Me n , ® n j, Pn) < CHnel + [d- 1) ^ logj*. 

i<H n 

Proof. The proof is similar to proposition 4. Let R n be a {Me n /n )-net of [— 2S, 2S] d , A n be 
a Me n -net of {(rti,..., uh u ) G M Hn : Ef=i \ u i\ A n] in the 71-distance, S n := {a > 0 : 

a 2 = n _2 /“ 8 ( 1 + Me n ) k , k £ N, k < n}, U n be a (Me n /n )-net of [0,7r/2], and for all 
k = 1 ,..., H n , let a (Me n /n )-net of {£ G : y/n(k — 1) < |£|<j < y/nk}. Pick 
(/, a) G 0 n j with f(x) = EfcLi u i — pi). Clearly we can find u G A such that 

Ei<H n \ u i~Ui\ < Me n , p G R% n such that \pi~pi\d < Me n /n for all* = 1,..., H n , <f G U^ n 
such that \(fi — (pi\ < Me n /n for allz = 1,..., H n , fi G V n j i such that |£j — £,i\d A Me n /n for all 
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i = 1,..., H n , and a € S n such that | logo - —log?! < Me n . Let f(x) = Yli<H n ^i K-g. f > X x ~V’i) 
denote the function built from the parameters chosen as above ; it follows 

||/-/||2,n< Y N + Y \ u i-™i\+ Y I^III%a(' - w) ~ K MS' ~ ^i)h,n 

i>H n i ^ Hn i<H n 

<2 Me n + C' \ui \|£j — £j|d + C' \ u i\\t L i — Mild + C Y^J \Ui\\<Pi - (pi\ 

j< Hn i<Hn. Hn. 

< 2M(1 + 3C)e n , 

for a constant C' > 0 depending only on g, because of proposition 12. Thus a 2(1 + 
3C')Me n -net of Q n j in the distance p n can be constructed with (/,?) as above. Recall 
that #R n < (4 Sn/(Me n ) d , #A n < (n/(Me n )) Hn , #S n = n, #U n < im/(2Me n ) and 
#14 < ( n 3 / 2 k/(Me n ) + \) d — (n 3;/2 (fc — l)/(Me n ) — l) d < {n z / 2 / {Me n )) d k d ~ l , where we used 
u d — v d < d{u — v)u~^ for v > u. Then the end of the proof is identical to proposition 4. □ 

9.2. Approximation of functions. Let £ > 0 and m, r > 1 be two positive integers. Let 
define the approximating kernel L m % : W d —> M by the expression 

d 

L m,r( x ) ■= >LM X ) II si n 2r (^i), 

2—1 

where A m, r > 0 is chosen so that f Rd 4n,r(s) dx = 1. Also let fo denote a suitable Whitney 
extension of fo from [—5, S] d to (see the proof of proposition 5). We may assume that 
fo and all its derivatives (up to order (3) are zero outside [— 2S, 2S] d . If it is not the case, it 
suffices to multiply fo by a smooth function that equal 1 on [—S, S] d and 0 outside [— 2S, 2S] d 
(for instance, think about the convolution of a bump function with a proper indicator set 
function). 

In order to achieve good order of approximation of fo when (3 is large, we construct a 
transformation of fo as follows. In the sequel we let p be the largest integer strictly smaller 
than (3. For all multi-index a € N rf , we define := J Rd i a I^ f (r) dx. By definition of 

Lm,r, the m™’ r, ^’s are always finite. Then we define 

fp = f^:=fo- Y doma^DVo, 

l<|a|<p 

where the coefficients (d a ) are defined in the same fashion as equation (21), with obvious 
modifications. 

Proposition 8. Let m, r > 1 be integers. For any (3 > 0 and any function fo G C /d [—S, S] d 
there is a constant Mg > 0 such that \L 7 m ^ * fg(x) — /o(x)| < Mg(logm/m)P for all x € 
[—S', S] d if 2r > p + 1 and £ = Kofiogm)^ 1 for a constant Kq depending only on g, [3 and r. 
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Proof. First assume 0 < /3 < 1. By assumptions on /q, there is M > 0 such that for all 
x,y eR d we have | f Q (x) - fo(y)\ < M\x - yf d . Then, 


/o(x) - L^ r * fo(x) 


< 


fo(x)~My) \ l LA x ~ y)\ d y 

<M I \x-yf d \L^ mr {x-y)\dy. 


Remark that for any r > 0 and all u E R d we have \ u i\ T — ^ max *=i,...,rf l^i| T < 

l l«i| 2 ) r/2 and \x-y\d = (Ef=i l^i-yil 2 )^ 2 < d 01 2 max i= i i ... )d < d^/ 2 Xu=i |®i- 

yi\ 0 . Then, because \g(x)\ < exp(-C' 0 |x|5), 


fo{x) - L^ mr * /o0) 


A 

< A1 


exp ( -Cod 1 l« 


— 1 


cir 


r>u m.r 


\ur e 


du 


d—1 


(26) 


i=1 -- J J fJi sin 2r (^) 

1/3 e -Co|«r/d sin2r ( m ^) f f p -C 0 \ur/d sin 2r {mju) 
sin 2r (£u) ) \L sin 2r (£u) 

We now bound the first integral of the rhs of equation (26). Let split the domain into 
three parts : D\ := (— l/(£m), l/(£m)), D 2 : = [—l/(£m), —7 t/£] U [l/(£m), 7 t/£] and D3 := 
M\(Z?i U Z^)- On Z?i and L>3 we always have sin 2 (m£u)/sin 2 (£u) < m 2 , whereas on D 2 it 
holds sin 2 (m£u)/ sin 2 (£u) < l/(^x) 2 . Therefore, 

f i^/ylf/d u) du 

Jr sm (^ti) 


< 


, 2r 


m 


ID 1 


du + C 2r [ W\ P ~ 2r du + m 2r [ \u\ 0 e" Col “ r/d du =: Ii + / 2 + J 3 . 
£>2 J D 3 


The bounds Ii < m 0+02r x )£ C+ 1 ) and J 2 < £ 0+0(1 + m 0+02r !)) are obvious. Now we 
bound I3. By Markov’s inequality, for any t < Co/d, we have 

/*oo /»oo 

/ u 0 exp(— Cou T /d) < e -7r// ^ / u 0 ex.p(—Cou T /d + ut) du. 

J 7 r/£ J 0 

Now it is clear that 13 < ?n 2r exp(— 7 t/£) since by assumption t > 1 and we can choose 
t < Co/d. It follows J 3 < ^-(d+i) if £ = /i 0 (logm) _1 for a suitable constant Kq > 0 
depending only on g, [3 and r. The same reasoning applies to the second integral of the rhs 
of equation (26), yielding the bound 


/o(x) - L m £ * f 0 (x) < An, r m /3+d(2r 1} (log m) 0+d , 


(27) 


whenever £ = it'o(logm) 1 . Hence, it remains to bound A m, r - By assumption, we have 
g(x) > 0 for all x E and a constant C > 0 such that g(x) > C on a set E C [—71,7r] rf ; thus 
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4- > [ 9 (x) n <** > om>* r n «. 

Am,r JE Sill (£Xi) JEZ 


(m£xi 


A2 r 


2=1 
m d(2r-l) 


-A-sin 2 (rq) m d ( 2r ^ 

11 —^ 2 — d “^ 


2 — 1 


e 


where A 7 := {m£x : x G A} has non-null Lebesgue measure by assumption. Combining the 
last result with equation (27), we get the estimate |/o(x) — L m ,£ * /o(x)| < m _/3 (logm) /3 for 
all x G provided £ < Kq (log m) ~~ 1 . 

Now assume that /3 > 1. Acting as in the previous paragraph, we can have < 

m _ l“l(logm)l“l for all \a\ < p, provided 2r > p + 1 and £ = ATg(logm) -1 for a suitable 
constant Kq > 0. Then the proof is identical to Shen et al. (2013, lemma 2). □ 

Proposition 9. Let m > l,r > (j3 + l)/2 be integers and £ = A'o(logm) -1 , with Kq as 
in proposition 8. There exists a discrete mixture f(x) = cq — pi) with N < 

(mlogm) d and for all i = 1,..., N : pn G [—25, 2 S] d , £j G [0, 2rKom/ logm] d , fa G [0, 7t/2] ; 
such that |/(x) — /o(x)| < (log m/m)P for all x G [—5, 5] d . Moreover YliLi l a *l ~ 1> an d 
for any i ^ j it holds |£j — £j|,j > 2(log m/m) 13 , \pi — p,j\d > 2(logm/m) /3 and \fa — fa\ > 
2(log m/m)P. 

Proof. We rewrite in a more convenient form for the sequel. Let ao := 1 and a*, = 

2(1 — fc/m) for all k = 1 ,,m— 1. Then hrst step is to notice that 


d 

L m,r-(x) = "l dr A^ ir p(x) JJ 

2 — 1 


afc cos(2£/cx'j) 

. fc=o 


From here, letting Z r := {0 ,... ,m— l} r and 5 = {—1,1}, 


d 

Si x ) II 
2=1 


Y a 'k 2 r Y COS ( 2 ^ Xi Sj=l & 3 k j) 

k£l r eS5 r 


where a' fc := a^, ... a^ r . and because n-=i cos(2 fkjXi) = 2 r E ee5 r cos(2£ ^ =1 ejkjXi). No¬ 
tice that |a 7 fc |2 _r < 1 for all k G Z r , and that 2| \ e j k j\ can take at most 1 + r(m — 1) 

values ; we denote these unique values ujj with j G J := {0,... ,r(m — 1)}. Then we can 
rewrite, 


d 

L fn,r ( X ) = ™ dT >L,r 9{ x ) JJ 

2=1 


Y a k COS(£WjXi) 
-k£j 
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where the coefficients a ^ satisfy \a k \ < 2#(Z r . x S r ) < 2(2m) r . Finally, for all k £ J d letting 
bk '■= 2 ~ d a'l i ... a'l d and a >k,i '■= us ki , with the same arguments as previously, 

L mA x ) = ™ dr >L, r g(x) Y Y bkCOS Td=i Uk,ieiX^ , 

k£j d e£S d 


where \b k \ < (2 m) dr for all k £ J d . Therefore, 

(m* X m,r)~ lL m,r * St iW 

E E h / fp(y)g(x - y ) cos ^ Ya= i Uk,iei(xi - y*)) dy 


fee./ 1 * eS5 d 


E E bk cos ^ X)iLi ^k,i^ix2j / d fp(y)g{x - y) cos ^ dy 

7 d ee5 d ^ Rd 

X] 5Z bk sin £?=i Uk,ie-ix2) J ^ fp(y)g(x - y)sin (tYli=i Uk,ieiy2) dy. 


k£j d e£S d 

+ 


fce7 d eS5 d 


We finish the proof by discretizing the integrals in the last equation. Obviously the proof are 
identical for both integrals, hence we only consider the first one. To ease notations, we set 
hk{x) '■= fp{ x ) nti cos(£ Sf =1 Wfc^ejXj). For any integer q > 1, proceed as in the proof of 
proposition 6 to find a signed measure P ktQ =: YlieC Pk,l 8 x k t such that f-_ 2 s 95 ]^ -R(x) dP k ^ q {x) = 
f[_ 2 <} 2 S] d R( x ) h k (x)dx for all polynomials R(x) of degree < q. with < (q + d)\ / (q\d\) and 
S/e£ I Pk,i\ = f\- 2 S 2 S] d \dk{x) \ dx < M for a positive constant M (recall that by construction 
of f/ 3 , we have ||//3||oo < + 00 , and supp fp C [—25, 25] d ). Then for any x £ M d , 



(y)g{x-y)dy- / g(x - y) dP k ^ q {y) 

'[-2S,2S] d 


< 


E 

|a|<r 


\D a g(0)\ 


a\ 


/ (X y) a h k (y)dy 

[ (x - y) a dP k) q{y) 

J[-2S,2S] d 

l[-2S,2S] d 


+ [ \Rq(y)\\hk{y)\dy+ [ \R q {y)\d\P k , q \(y), (28) 

J[-2S,2S] d J[-2S,2S] d 


where \R q (y)\ < sup| a i =9 |-D“y(0)||y|^/y!. The first term of the rhs of equation (28) is null by 
construction of Pk, q - As in the proof of proposition 6 , the two last terms of equation (28) are 
bounded by a constant multiple of 


exp { —(1 - 7 ) 9 logy + y(l + log(2\/d5))| . 
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Then the error of approximating the integrals is o(m if q = K\ log m for a suitable constant 
K\ > 0 depending only on j3 and 7 . Since for £ = /^(logm ) -1 we have, 

mdr ^m,r ^ ^ IM 1$ mdr x ™r d ( 2r ~ 1 ' > x (\ogm)~ d x ffj d x m dr < (log m)~ d , 

k£j d e£S d 

the error of approximating Lm,r * fp by the discretized version does not exceed o(m~^) when 
q = K\ log m. The conclusion of the proposition follows from elementary manipulation of 
trigonometric functions and because < q d < (logm) rf . 

It remains to prove the separation between the atoms of the mixing measure, but this 
follows from proposition 13 with the same argument as in proposition 6 . □ 

9.3. Kullback-Leibler condition. 

Lemma 8. Let /o G C / 3 [—S,S] d . Then there exists a constant C > 0, not depending on n, 
such that H(K n ( 6 o : e n )) > exp(—ne^) for e^ = Cn~ 2 ^^ 2 ^ +d ^ (\ogn) 2 ^ 2d+l ^^ +d \ 

Proof. Let f m (x) = YliLi ^-L-.4>S X ~ Ti) be as in proposition 9. For any i = 1,..., N define 
the sets Ui := {£ G : |£ - ^\ d < (log m/m) 13 }, V := {/a G [-25, 2S] d : \p - m\ d < 
(log m/m)P} and Wf := {</> G [0,7r/2] : \<f — cf>i\ < (log m/m)^}. Notice that these sets are 
disjoint, and for any i = 1,..., TV we have 

aF{Ui x Vi x Wi) > |^|^' ai 2 (logm/m) /3( ' ai+ai0+ai3 ^ > (log m/m) q , 

where q : = da\ 2 + /3(a\ +aio + «i 3 )- Then proceed as in lemma 5, to find constants K\, K 4 > 0 
such that with e n = Cf 1 I\\{\ogm/m )^, 

U(K n (e 0 ,e n )) > exp|-iF 4 m d (logm) d+1 | . □ 

Appendix A. Symmetric Gamma distribution 


The symmetric Gamma distribution SGa(a,6), with a, b > 0 is the distribution having 
Fourier transform 1 1 —)• (1 + t 2 /b 2 )~ a . It is easily seen that if X ~ Ga(a, b) and Y ~ Ga(a, 6), 
with X and Y independent, then X — Y has SGa(a, b) distribution. 


Proposition 10. Let Z 


SGa(a, b). Then for any positive integer n, 
(2n)! (a)( n ) 


E Z 2n = 


EZ 2n+1 = 0 . 


n\ b 2n ’ 

Moreover, the distribution SGa(a, b) is determined by its moments (in the sense that SGa(a, b ) 
is the only distribidion with this sequence of moments). 


Proof. From definition of SGa(a,6), the random variable Z is distributed as X — Y, where 
X,Y Ga(a, b) and X,Y are independent. Then it is obvious that all odd moments must 
vanish. For the even moments, we write, 
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1 

b 2n 


k =0 ' ' 


(2n)! (a)( n ) 

n! b 2n ’ 


where the last equality can be obtained after some algebra. To see that SGa(a, 6) is de¬ 
termined by its moments, we check that Carleman’s criteria applies (Gut, 2006), which is 
straightforward. □ 


Proposition 11. Let X ~ SGa {u,rf), with 0 < a < 1 and 77 > 0. Then there is a constant 
C > 0 such that for any x E M and any 0 < 6 < (3 + t?) - 1 we /uiwe Pr(|A — x\ < <5) > 
C<5e-( 3+r? )l x lr(a)- 1 . 


Proof. Assume for instance that x > 0. Recalling that X is distributed as the difference of 
two independent Ga(a, rf) distributed random variables, it follows 

1 poo 1 rx-\-y -\-8 

Pv(\X-x\ < 6 )>— / y^e^y— / z 01 - 1 e~ r,z dzdy. 

r(a) Jo (^0 J x+y 

Because a < 1, the mapping z 1 —y z a ~ 1 e~ riz is monotonically decreasing on M + , then the last 
integral in the rhs of the previous equation is lower bounded by 5{x + y + e~y( x +v+ s ) > 
J e -(3+r?)(a:+y+<5)^ Then 

A p-(3+?7)(a:+{) roo 

Pr(| A - s| < <5) > — - / y a_1 e" (3+2 ^ dy 

r(a)“ Jo 

(5 e ( 3 + 77 )(a;-(-<5) ^ e —(3-l-77)|a;| 

“ (3 + 2rj) a T{a) ~ e(3 + 2rj) a T(a )' 

The proof when x < 0 is obvious. □ 


Appendix B. Auxiliary results 

Proposition 12. Let K A (x) = g(A~ 1 x), and assume that for all multi-index k £ with 
\k\ = 0,1,2 the mapping x 1 —> x k g(x ) belongs to L 1 (W d ). Let || • || be the spectral norm on £. 
Then there is a constant C > 0 such that for all x,/ii,H 2 £ and all A\,A 2 £ £ arbitrary 
with ||A — Af 1 A 2 11 A ||A — Af 1 Ai|| small enough, 

\K Al (x - /ri) - K m {x - &) | < C\\I - A^A 2 \\ A C\\I - A” 1 A!|| 

+ C (11^1 1 |l A 11^-2 1 |l) I/A — P 2 I d- 

Proof. Starting from the triangle inequality, we have 

\K Al (x - to) - Ka 2 (x - to)\ < \K Al {x - to) ~ K A2 {x - to)\ 

+ \K Al (x - to) ~ K Al {x - to)\ (29) 

We recall that K A (x) := g(A~ 1 x). To bound the first term, it is enough to bound g(x) — 
g(Af 1 A 2 x) for all x £ R rf . Let ( B n )n£N and (C n ) ng pj be two arbitrary sequences in £ such 
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that ||7 — B n 1 C n || < 1/n, and let g denote the Fourier transform of g. Then, 

sup | g(x) - g(B~ 1 C n x)\ < [ |</(£) - | det(B~ 1 C n )\g(B~ 1 C n ^)\ df. 
zeK'* 7R d 

Remark that | detT?” 1 ^! < 1 + | det(7 — B~ 1 C n )\, and ||7 — < 1/n implies that 

| det(7 — B~ 1 C n )\ < Vd/n d . Also, | B^C n t\ d < \\I - 5" 1 C ' n ||\£\ d + |£| d < (1 + l/n)|£| d . It 
turns out that, 

lirn | det B~ 1 C n \g{B^ 1 C n f,) = g(f,). 

n—> oo 


We now prove that {\ det B~ 1 C n \'g(B~ 1 C n f) : n > 2} is dominated. By assumption, g E 
L 1 (M d ), as well as x i-a x k g(x) with |fc| = 1, 2. This implies that |g(£)| < C(l + |^|rf) -2 for some 
C > 0. We already saw that IdetT?” 1 ^) < l + l/n d , and |£|d < \B~ 1 C n ^\d + \{I — B^C^^ld 
implies \B~ l C n ^\d > (1 — l/n)|£|<j. Therefore, for any n > 2, 


detR- 1 ^!^- 1 ^) < 


CldetR- 1 ^! 

(1 + | B^CnZU) 2 


< 


C(1 + 2 ~ d ) 

(i + ieu/2) 2 ' 


Then the dominated convergence applies, and 


lirn sup \g{x) - g{B n 1 C n x)\ = 0. 

n—»oo 

x£R a 


The second term of the rhs of equation (29) is bounded above by \Af 1 (gi — p2)|d 5; |pi — 

/^ 2 1 d, using Lipshitz continuity of g. Using a symmetry argument, the conclusion of the 
proposition follows. □ 


Proposition 13. Let K^{x) = g(x) cosQ^ =1 + (f)), and assume that for all multi-index 
k E N d with \k\ <1 we have sup xg]R d \x k g(x)\ < +oo and sup xgR d \D k g(x)\. Then there is a 
constant C > 0 such that for all x, pi, p 2 , £i, £2 £ K - d and all <f>i,(j )2 £ [0, vr/2] 

\K^(x — pi) - K^ 2 (x — P2)| < C|fi — ^2|d + C|pi - P2|d + C\<f>i - (f> 21- 


Proof. We write, 


\K^iM x ~ Ti) ~ Kfr^ix - p 2 )| < \K( u fa(x - pi) - K^(x - p 2 )| 

+ \ K ii,<t>A x - T 2 ) ~ 77 5l ,0 2 (x - p 2 )| + \Kt u fa(x - p 2 ) - AT 6i ^ 2 (x - p 2 )|. 

Because g has bounded first derivatives, it is Lipschitz continuous for some Lipschitz contant 
K > 0, then the first term of the rhs is bounded above by K\gi — P 2 Id- With the same 
argument, the second term is bounded by a constant multiple of ||p||oo|0i — <t> 2 \- The last term 
of the rhs is easily bounded, because for all x E M d : 

\ K ti,<h(x) ~ K b,<h 0)1 < I cos (Xn=i 6,i®» + 02 ) - cos{J2i =1 b,iXi + 4>2)\\g(x)\ 

d 

< 5^16 ,iXi - &,iXi\\g(x)\ 

2=1 
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1/2 


1/2 


- ( “ £ 2i 


\ x ^ x )i s 


\i =1 


s. %— 1 


where the last line holds by Holder’s inequality. Then the conclusion follows x H y x k g(x) is 
bounded for all | /e | = 1. □ 

Proposition 14. Let g(x) = exp( — |x|^/2). Then sup^gjjd \D a g(x)\ < exp(^|a| log |a|) for 
all a e N d . 


Proof. For any a 6 N rf , let k = |a| = y^f_ 1 a*. When k < 2, the result is obvious. Now 
assume that k > 2. By Fourier duality, we have for all x E 


£>“<?(*) I < 


d 

u a g{u)\du<2 k / 2 X[T 

1=1 


(^) 



. . a i + 1 


where the last inequality follows from Stirling formula. Then it is clear that, 


|O“ 9 (0)|<exp 


L i= 1 i=l 


log(l + Oj 


The result follows because for all k > 2 we have Yli=i a i l°g(l + &i) < Xa=i a i l°g(l + k) < 
(1/2+ logk)Y,i=iOii < k/2 + klogk. □ 
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